Suppose we have a report and we want to find the sentences that are talking about numerical things....
Originally inspired by When you get data in sentences: how to use a spreadsheet to extract numbers from phrases, Paul Bradshaw, Online Journalism blog, from which some of the example sentences (sic!) are taken.
Distribution: https://twitter.com/paulbradshaw/status/1158752556958519297
quantulum: extract quantities from natural language text;ctparse: extract time / date related quantities from natural language text;r1chardj0n3s/parse: easy scrape / regex extraction from semi-structred text using format() like patterns; example use here;dateparser [docs]: "easily parse localized dates in almost any string formats commonly found on web pages" (includes foreign language detection);
In [152]:
sentences = [
'4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months',
'No quantities here',
'I measured it as 2 meters and 30 centimeters.',
"four years and six months' imprisonment with a licence extension of 2 years and 6 months",
'it cost £250... bargain...',
'it weighs four hundred kilograms.',
'It weighs 400kg.',
'three million, two hundred & forty, you say?',
'it weighs four hundred and twenty kilograms.'
]
quantulum3quantulum3 is a Python package "for information extraction of quantities from unstructured text".
In [153]:
#!pip3 install quantulum3
from quantulum3 import parser
In [154]:
for sent in sentences:
print(sent)
p = parser.parse(sent)
if p:
print('\tSpoken:',parser.inline_parse_and_expand(sent))
print('\tNumeric elements:')
for q in p:
display(q)
print('\t\t{} :: {}'.format(q.surface, q))
print('\n---------\n')
In [155]:
import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
In [171]:
text = '''
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250.
It was blue. It took forty five minutes to get it home.
What a day that was. I didn't get back until 2.15pm. Then I had some cake for tea.
'''
In [172]:
doc = nlp(text)
for sent in doc.sents:
print(sent)
In [173]:
for sent in doc.sents:
sent = sent.text
p = parser.parse(sent)
if p:
print('\tSpoken:',parser.inline_parse_and_expand(sent))
print('\tNumeric elements:')
for q in p:
display(q)
print('\t\t{} :: {}'.format(q.surface, q))
print('\n---------\n')
In [1]:
url = 'https://raw.githubusercontent.com/BBC-Data-Unit/unduly-lenient-sentences/master/ULS%20for%20Sankey.csv'
In [2]:
import pandas as pd
df = pd.read_csv(url)
df.head()
Out[2]:
In [178]:
#get a row
df.iloc[1]
Out[178]:
In [179]:
#and a, erm. sentence...
df.iloc[1]['Original sentence (refined)']
Out[179]:
In [180]:
parser.parse(df.iloc[1]['Original sentence (refined)'])
Out[180]:
In [206]:
def amountify(txt):
#txt may be some flavout of nan...
#handle scruffily for now...
try:
if txt:
p = parser.parse(txt)
x=[]
for q in p:
x.append( '{} {}'.format(q.value, q.unit.name))
return '::'.join(x)
return ''
except:
return
In [207]:
df['amounts'] = df['Original sentence (refined)'].apply(amountify)
In [208]:
df.head()
Out[208]:
We could then do something to split multiple amounts into multiple rows or columns...
The sentencing sentences look to have a reasonable degree of structure to them (or at least, there are some commenalities in the way some of them are structured).
We can exploit this structure by writing some more specific pattern matches to pull out even more information.
In [6]:
df['Original sentence (refined)'][:20].apply(print);
It makes sense to try to build a default hierarchy that extracts from more specific to less specific structures...
For example: